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Abstract 

To train a statistical spoken dialogue system (SDS) it is essen¬ 
tial that an accurate method for measuring task success is avail¬ 
able. To date training has relied on presenting a task to either 
simulated or paid users and inferring the dialogue’s success by 
observing whether this presented task was achieved or not. Our 
aim however is to be able to learn from real users acting under 
their own volition, in which case it is non-trivial to rate the suc¬ 
cess as any prior knowledge of the task is simply unavailable. 
User feedback may be utilised but has been found to be incon¬ 
sistent. Hence, here we present two neural network models that 
evaluate a sequence of turn-level features to rate the success of 
a dialogue. Importantly these models make no use of any prior 
knowledge of the user’s task. The models are trained on dia¬ 
logues generated by a simulated user and the best model is then 
used to train a policy on-line which is shown to perform at least 
as well as a baseline system using prior knowledge of the user’s 
task. We note that the models should also be of interest for eval¬ 
uating SDS and for monitoring a dialogue in rule-based SDS. 
Index Terms: spoken dialogue systems, real users, reward pre¬ 
diction, dialogue success classification, neural network 

1. Introduction 

The dialogue manager is the core component of a spoken dia¬ 
logue system (SDS). It controls the interaction between the sys¬ 
tem and the user, and is central to the overall quality of the user 
experience. Casting an SDS as a partially observable Markov 
decision process (POMDP) has been shown to be beneficial by 
allowing the dialogue manager to be optimised to plan and act 
under the uncertainty created by noisy speech recognition and 
semantic decoding mm. The POMDP policy dictating the ac¬ 
tions taken by the SDS is trained in an episodic reinforcement 
learning (RL) framework ( 5 ) whereby the agent receives a re¬ 
inforcement signal after each dialogue (episode) reflecting how 
well it performed. 

The goal of this paper is to demonstrate that an SDS can be 
trained via interactions with real users where no direct knowl¬ 
edge of the user’s goals is known at any point in the dialogue. 
In all previous works the training of an SDS has been done with 
either recruited subjects mm who are presented with a pre¬ 
defined task to complete, or via simulated users Emuim 
who randomly sample a goal over the specific ontology. In both 
cases, the specific prior knowledge of the user’s goal is used to 
calculate an objective measure ( Obj ) of whether the SDS com¬ 
pleted the task or not. In real world systems prior knowledge of 
the user’s goal is simply not available, making any calculation 
of an ‘objective’ measure nearly impossiblqj Knowledge of 
task success or failure is essential however for training an SDS. 


One approach to this problem is to ask the user for feedback 
at the completion of each dialogue. Yang et al. m proposed 
using collaborative filtering to infer user preferences given a set 
of user-rated dialogues. However these ratings were very noisy 
CD which lead to slow learning and poor policies CD- Also in 
real-world systems it is not clear that a user would be coopera¬ 
tive enough to provide feedback once the dialogue is completed. 

Other research related to this problem includes the PAR¬ 
ADISE framework in presented by Walker et al. for evaluat¬ 
ing a dialogue, where a linear function of task completion and 
predefined dialogue costs were used for inferring user satisfac¬ 
tion. However, as noted above, task completion is not directly 
computable with real users and concerns relating to the theoret¬ 
ical motivation of the model have also be raised cm A frame¬ 
work that does potentially enable the training of SDS with real 
users was presented by Asri et al. [16, Qjl, whereby a reward 
function was learnt over a summary state space based on dia¬ 
logue data labelled by experts for task success. However, no 
attempt was made to learn a policy with real users. 

When training an SDS with paid users given specific tasks, 
a common issue is that they are not motivated by a real informa¬ 
tion need. As a consequence, they ofteij^Jfail to follow exactly 
the presented goal, resulting in OZ?y=failure even though the 
SDS may have actually provided everything asked of it. In order 
not to penalise the SDS by learning with such dialogues we have 
previously also asked the user for their opinion of whether they 
achieved the task goals thereby obtaining a subjective success 
rating ( Subj ). Then for policy learning, only those dialogues for 
which Obj=Subj CD are used, the remainder being discarded. 
With real users it is not possible however to calculate Obj since 
the true goal of the user can not be known. It is therefore essen¬ 
tial to find effective methods for computing rewards with real 
users when the underlying task is unknown. 

This paper investigates the use of neural networks to rate 
task success automatically on-line by tracking the dialogue as 
it evolves. In Section [2] two types of neural networks are de¬ 
scribed, recurrent neural nets (RNNs) and convolutional neural 
nets (CNNs), and the choice of features used to track the dia¬ 
logue are discussed along with the different types of predictions 

! We note that this is not a problem faced in training agents in many 
common POMDP tasks: episode success in grid-worlds, games or pole¬ 
balancing is well defined and easily computed 0. In comparison, di¬ 
alogue is an ill-posed problem for which it is non-trivial to classify the 
success of an episode when there is no prior knowledge of the user’s 
goal. There is even ambiguity as to what the label success means for a 
dialogue. Our definition of success is based on the performance of the 
dialogue agent, specifically whether it provided all of the information 
asked of it for a domain entity satisfying the users constraints, e.g. the 
phone number for a cheap restaurant in the north. 

2 This case occurs in our experience at least 20% of the time 03 



the models are trained to produce. The experimental evalua¬ 
tion is then presented in Section [3] Two performance metrics 
are computed to evaluate the trained NN models: accuracy in 
estimated task success and the root mean square error in esti¬ 
mating the reward function. Performance in on-line learning 
with (paid) users is then assessed and the effectiveness of the 
neural network-based reward rating is demonstrated. Finally, 
conclusions are presented in Section [4] 

2. Neural network dialogue classification 

Two types of neural network (NN) models were investigated for 
determining the final reward given to the reinforcement learn¬ 
ing agent. The structures of these models are described in sub¬ 
sections [2^2] [23] and [24] First though we discuss their shared 
feature inputs and training data. 

2.1. Training data, dialogue features and generalisation 

The data used to train all models was collected by training sev¬ 
eral Gaussian Process policies HI from scratch with an agenda- 
based simulated user |[9lH0l. The labels of success or failure 
for each dialogue were computed based on an objective criteria 
of whether or not the agent met the simulated users’ goals gen¬ 
erated at the start of each dialogue. The reinforcement signals 
used during policy training were simply to give a -1 reward at 
each turn to promote speed, and a final reward of +20 at com¬ 
pletion if the dialogue was successful, otherwise 0. The return 
(cumulative reward) R is therefore calculated as: 

R = 20 X ^-success ~ N (1) 

where N is the number of turns in a dialogue and 1 success is an 
indicator function for success. 

For all models.a domain specific feature vector was ex¬ 
tracted at each turijj consisting of the following concatenated 
sections: one-hot encoding of the user’s top-ranked dialogue 
act, the real-valued belief state vector formed by concatenat¬ 
ing the distributions over all goal, method and history variables 
ED, one-hot encoding of the summary system action, and the 
turn number. This is shown in Figure [T] 
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Figure 1: Feature vector f t extracted at each turn t. 


This form of feature vector was motivated by considering the 
primary information a human would require to read a transcrip¬ 
tion and rate the success of the dialogue. The inclusion of the 
full belief vector, plus user and systems actions makes this fea¬ 
ture vector domain and system dependent. 

The goal with these NN models is to enable policy learn¬ 
ing with real users by not requiring any prior knowledge of the 
users’ goal. Their rating predictions are used directly to pro¬ 
vide the RL feedback to the dialogue agent. Hence they should 
consider the information requested by the user over the whole 
dialogue and ideally evaluate whether the policy provided ev¬ 
erything that was asked for or not. It is expected that by training 
the NN models on data from the simulated users evaluated by 
the objective measure, they will generalise to be able to provide 
this ideal rating when assessing dialogues with real users whose 
goals are not known (and hence the objective assessment can not 
be calculated). The reason to expect the models to generalise in 

3 Turn here means system + user exchange. 
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Figure 2: Schematic of the two NN predictive models. An un¬ 
rolled view of the RNN (top) and CNN model (bottom). The 
feature vectors extracted at turns t — 1,..., N are labelled ft. 

this way is that the simulated users have predefined tasks and in¬ 
form the system meticulously about all of them. Hence, the ob¬ 
jective measure of task completion indicates exactly whether or 
not the system provided the information requested of it. There¬ 
fore by training on these supervised learning pairs of data gener¬ 
ated by the simulated user and ratings provided by the objective 
measure, the resulting NN predictive model should be a good 
detector of whether or not the system provided what the user 
requested from it. This is the desired indicator of the system’s 
behaviour and a good reinforcement signal for policy learning. 

Dialogues of course vary in their total number of turns. By 
extracting this feature vector at each turn a variable size set is 
obtained for each dialogue. The two NN models we investigate 
both make a single prediction for the whole dialogue, but do so 
in different ways, in particular with respect to how they handle 
this variable length sequence. 

2.2. Recurrent neural network model 

The recurrent neural network (RNN) model f20) is a subclass 
of neural network that has feedback connections from one time 
step to another. The ability to succinctly capture and retain his¬ 
tory information makes it suitable for modelling sequential data 
with temporal dependencies. It has been shown to be success¬ 
ful in various natural language processing tasks such as lan¬ 
guage modelling t2lll22ll23l and spoken language understand¬ 
ing (SLU) Ml 

Here the RNN model is adopted to manage the variable 
length of each dialogue by simply updating its hidden layer h t 
with the input feature vectors ft at each turn t. Once the dia¬ 
logue ends the hidden layer is then connected to an output layer 
to make a single prediction of the whole dialogue as depicted in 
the top half of Figure [2] 

2.3. Convolutional neural network model 

Also investigated was a convolutional neural network (CNN) 
(251 which has been successfully used for image classification 
1261 and on sequential modelling problems such as sentiment la¬ 
belling of sentences (27). Here the CNN makes predictions by 






























considering the whole dialogue as a matrix formed by append¬ 
ing turn based feature vectors. On completion of the dialogue, 
a convolutional filter of size (F,W), where F is the turn based 
feature dimension and W is a width across time, is applied in a 
narrow convolution across the dialogue matrix. Multiple filters 
are used, each of which creates its own feature map. A max¬ 
pooling operation then reduces each of the feature map vectors 
to a scalar. Finally, the resulting scalars are concatenated and 
feed into a standard multi-layer perceptron (MLP), which may 
consist of multiple layers. This process is shown in the bottom 
half of Figure[2] where 4 feature maps are employed. 

For the CNN, the mapping of the variable size input to a 
fixed size is provided by the pooling operation applied to the 
feature map outputs. The dialogue matrix is padded with W — 1 
zero vectors on each side to allow a narrow convolution to al¬ 
ways be performed (even if the dialogue has only 1 turn). Im¬ 
portantly this also allows the convolutional filter to move across 
time (turns) and consider turn sequences of differing lengths. 

2.4. RNN & CNN shared output layer 

The RNN and CNN models share the same network structure in 
their final layer, and this structure is determined by the choice 
of supervised training targets, of which three types were con¬ 
sidered, all derived from the described data. 

1) In the first case the NN models are classifiers which are 
trained to predict the Obj success or failure label for each di¬ 
alogue. The targets are {0,1} and the final layer of the NN 
models outputs a scalar through a sigmoid activation function 
and is trained with a cross-entropy loss. The outputs from this 
network is a probability p that the dialogue is a success, and the 
hard class label predicted by the model is taken as 1 if p > 0.5, 
else 0. This hard label is used to determine whether to give a 
final reward of +20 during policy learning, as per Eqn. {T}. 

In the other two cases, given that our goal is to provide the 
final RL reward for policy learning, we also investigate predict¬ 
ing this reward directly. 

2) The second case is a multiclass classification problem 
where the class labels are integers representing the possible re¬ 
turns for the whole dialogue. The number of different returns 
possible with Eqn. |lj is constrained by setting a maximum 
number of allowable turns for a dialogue. A softmax activation 
is used in the final layer of the NN models with a cross-entropy 
loss. The one-hot encoding of the target distributions are con¬ 
volved by a discrete Gaussian kernel in order to smooth and 
reduce the magnitude of the return prediction errors. 

3) The third case is a regression problem with the actual re¬ 
turn value used as the training target. The final layer of the NN 
models have no non-linearity (activation) and the whole model 
is trained with a mean-square-error (MSE) loss function. Dur¬ 
ing policy learning with cases 2 & 3 a per-turn penalty of 0 
would be used, since these models predict the return rather than 
the final reward and so implicitly include the total number of 
turns penalty in the predicted return. 

3. Experiments 

3.1. Domain and shared SDS components 

In all experiments the Cambridge restaurant domain was used, 
which consists of approximately 150 venues each having 6 at¬ 
tributes (slots) of which 3 can be used by the system to constrain 
the search and the remaining 3 are informable properties once a 
database entity has been found. 


The shared core components of the SDS used over all ex¬ 
periments were a domain independent ASR, a confusion net¬ 
work (CNet) semantic input decoder (28], the BUDS fT9l belief 
state tracker that factorises the dialogue state using a dynamic 
Bayesian network and a template based NLG of the systems se¬ 
mantic actions. All policies are trained by GP-SARSA m and 
the summary action space contains 20 actions. 

With this ontology, the number of elements in each of the 
four segments of the feature vector used by the NN models were 
21, 575, 20, 1 respectively for the user act, full belief state, sys¬ 
tem act and turn number. This resulted in a vector of F = 617 
components at each turn. The turn number was expressed as a 
percentage of the maximum number of allowed turns, here 30. 
The one-hot user dialogue act encoding was formed by taking 
only the most likely user act estimated by the CNet decoder. 

3.2. Results: Neural network training 

In this section results of training the two NN model^] on the 
simulated user (9) dialogues scored by the Obj measure are pre¬ 
sented. Two training sets were used consisting of 18K and IK 
dialogues. In all cases a separate validation set consisting of 
IK dialogues was used for controlling overfitting. Training and 
validation sets were approximately balanced regarding objec¬ 
tive success/failure labels and collected at a 15% semantic error 
rate (SER). Prediction results are shown in Figure [3] on two test 
sets; testA: IK dialogues, balanced regarding objective labels, 
at 15% SER and testB: 12K dialogues, containing 3 GP policies 
trained from scratch on 1000 dialogues, collected at an SER of 
0,15,30 and 45 as the data occurred (i.e. with no balancing 
regarding labels). 

We used three different targets (cost functions) as described 
in section [2A| to train both the RNN and CNN models. Eqn. ([TJ 
was used to calculate the return from the binary success classi¬ 
fication (case 1 in [241; for cases 2 and 3 the success label was 
inferred from Eqn. (jjjl. The results are depicted in Figure [3] 
where the left y-axis is the success classification accuracy (bar 
plot), and the right y-axis is the root-mean-square-error (RMSE) 
of the return (scatter plot). 

We see that the RNN outperformed the CNN in most cases. 
When using the large training set (18K, sub-figures 1 & 3) all 
models obtained over 93% success label accuracy while the 
RNN more accurately estimated the return, getting within ~ ±3 
of the objective return targets on testA and within ^ ±5 on 
testB. Without a simulated user it may not be possible to access 
18K labelled training dialogues so results are also presented 
when training the models (with exactly the same structures) on 
only IK dialogues. Sub-figures 2 & 4 show that the models are 
reasonably robust to this large reduction in the amount of train¬ 
ing data, with the binary classification models being the most 
accurate and again the RNN outperforming the CNN. 

These results give confidence that the NN models, sequen¬ 
tially evaluating turn level features, are able to serve as good 
dialogue success detectors. The results on set testB also show 
that the models can perform well in environments with varying 
error rates as would be encountered in real operating environ¬ 
ments. 


4 All NN models were implemented using the Theano library (29] 
ED. The RNN hidden layer used 300 units with sigmoid activations 
for all cases. The CNN created 50 feature maps with filters of width 
W = 30, and a 2 layer FFNN where the size of the 1st layer was 300 
in case 2 and 50 otherwise. Stochastic gradient descent (per dialogue) 
was used for training. 
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Figure 3: NN model training results: Prediction of RNN and CNN models trained on 18K and IK dialogues and tested on sets test A 
and testB (see text). Results of success/failure label accuracy (left axis) are represented as bars, and RMSE (right axis) as scatters. 


3.3. Results: On-line policy training with the RNN model 

Based on the above results, the binary RNN classification 
model was selected for training policies on-line. Two systems 
were trained on-line by users recruited via Amazon Mechani¬ 
cal Turl0 Firstly, a baseline system was trained which used 
knowledge of the set tasks to compute the reward as described 
in Section [l] and secondly a system was trained using only the 
RNN to compute the reward signal. Three policies were trained 
for each system, then averaged to reduce noise. Learning began 
from a random policy in all cases. 

Figure[4]shows the on-line learning curve of the reward and 
number of turns when training the systems with 500 dialogues. 
For both plots, the moving average was calculated using a win¬ 
dow of 100 dialogues and each result was the average of the 
three policies in order to reduce noise. It can be seen that the 
RNN system was able to learn at least as good a policy as the 
baseline system. Further, the baseline system actually required 
~ 850 dialogues (due to discarding cases where Objf^Subj), 
while the RNN system used every dialogue and was therefore 
more efficient and less costly. 

In order to evaluate the resulting policies, we collected a 
further 600 dialogues, turning off policy learning and asking 
the Mechanical Turkers to rate, in addition to Subj , the quality 
of the dialogue by answering the question “Do you think this 
dialogue was successful ?” on a 6-point Likert scale. Each of the 
3 policies trained for the baseline and RNN systems received 
100 dialogues and the average quality rating (interpreted as a 
number between 0 and 5) is shown in Table [l] along with one 
standard error. We report only the quality and Subj since the 
Obj can be misleading due to Turkers not explicitly following 
the task, as highlighted in Section [l] The results indicate that 
the RNN dialogue success classifier was able to train a policy 
at least as well as the baseline system even though the baseline 
was trained via direct use of the prior knowledge of the users 
goal and selected only dialogues where Obj=Subj to learn from. 

Table 1: Subjective evaluations of the trained baseline and RNN 
policies. Quality: 6-point Likert scale, Subj : binary rating. 



baseline 

RNN 

Quality (0-5) 

3.77 ± 0.087 

3.94 ± 0.068 

Subj (%) 

84.9 ± 2.2 

89.5 =b 1.7 



Figure 4: Learning curve with reward and number of turns dur¬ 
ing on-line policy optimisation. The baseline system (black 
line) updates the policy only when the Subj and Obj measures 
agreed. The green line shows training under the RNN dialogue 
success predictor. Yellow and blue lines are standard errors. 


4. Conclusions 

This paper has investigated the use of neural networks for rat¬ 
ing success in a spoken dialogue system. Both RNNs and CNNs 
were shown to be capable of good performance when substan¬ 
tial training data is available, but RNNs were more robust when 
training data was limited. When compared to a baseline (which 
used prior knowledge of the users goal) for on-line policy learn¬ 
ing with real users, the RNN delivered slightly improved per¬ 
formance suggesting that this approach does provide a way of 
training real-world systems on-line with users whose goals are 
unknown. 

Currently work is focused on investigating less domain spe¬ 
cific features, the dependence on the simulated user, transfer¬ 
ring the RNN models to new domains, and using them for re¬ 
ward shaping cm to speed up policy learning. We note finally 
that the models should also be helpful for rule based SDS to 
adjust behaviour or know when to hand control from the com¬ 
puter agent to a human to retrieve a failing dialogue, and for 
evaluation of SDS generally. 


5 Although our motivation is to train with real users and the NN mod¬ 
els we have introduced now enable this, we are restricted here to using 
Mechanical Turkers since we do not have an actual service or product 
to attract real users to. 
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